Zillow: Finding the Best Time to Buy a House

Name:

  • Janelle Samansky
  • Jesus Quintero
  • Oscar Loza-Corona
  • Cesar Perez-Salas

The purpose of this proposal is to analyze the Zillow housing data for the pacific coast, more specifically California, Washington, and Oregon. By seeking an answer to 1) how different the housing markets are between these three states so then to gain further understanding on 2) where in each state would be the most affordable and where would be the most expensive county to buy a house, 3) to determine when is the right time to buy a house, focusing in on if a certain season (Summer, Autumn, Winter, or Spring) often sees a decrease or increase in home purchases, and 4) briefly look at the housing bubble so as to have a more comprehensive idea of the changes within the housing market.

By comparing the estimated average housing prices for each state versus their true average housing prices it was clear California was the most expensive to live in, much like was anticipated. Taking it a step further, it was hypothesized that the population density size of a county could impact the average housing price within that county and while that appeared true in both Washington and Oregon, California’s representation of the data did not support this claim. If surrounding geographical factors were taken into consideration (i.e. location of state parks, and Silicon Valley, SF) California’s results would have possibly not been as surprising as they initially were. With the limited prior knowledge or real estate trends, it was expected that Summer would be a great time to view a house on the market, but may be more competitive and less cost effective, while Fall may be a great time to buy a house since open houses may be less frequented and therefore less competitors trying to outbid one another.

Intro

Almost everyone would consider buying a house to be an important milestone in life, and with the fact that we will be graduating from UCSB soon the thought of buying a perfect home sounds intriguing due to it being one of the next chapters of our lives. By using data found on Zillow, a popular real estate website, we will analyze the housing market of California, Oregon, and Washington to forecast average housing prices within each state, and determine the best counties to buy homes in, when is the best time to buy a house, as well as understand how other specified factors influence the decision process of buying a house.

In [32]:
import pandas as pd
import numpy as np
import math
##Must "conda install plotly" on terminal before proceeding
In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
In [3]:
df_sell_price= pd.read_csv('City_Zhvi_3bedroom.csv', encoding='latin-1')

Bootstrap: Estimating the mean sale price per State

Zillow contains housing data throughout the United States, by concentrating on the housing data for California, Oregon, and Washington we focus our analysis on data that is more relevant to us as residents in California so as to gain better understanding of the varying housing markets on the West Coast. We also agreed we wanted to look at a more recent representation of the housing market after the crash, so we focus our data to contain the housing prices from the start of 2010 (January) to the end of 2018 (December).

In [4]:
CA = df_sell_price.loc[df_sell_price["State"]=="CA"]
WA = df_sell_price.loc[df_sell_price["State"]=="WA"]
OR = df_sell_price.loc[df_sell_price["State"]=="OR"]

df_pacific = pd.concat([CA,WA,OR])
pacific_part3 = df_pacific.loc[:,"2009-01":"2018-12"]
holder1 = df_pacific.iloc[:,0:6]
holder = df_pacific.loc[:,"2010-01":"2018-12"]
df_pacific = pd.concat([holder1, holder], axis = 1)

After narrowing down our data to California, Oregon, and Washington from 2010 to 2018, we use bootstrapping to estimate the average housing price for each state and compare our estimates to the true average housing prices for the states. By first defining a function that sampling with replacement, we then took the average by the rows, which in our case was by the zip codes. We then took the average of those computed means and store the results to an empty array. For each state we then utilized the bootstrap function and took the average of the outputted array and rounded our results up. This gave us our estimated housing price for each state. In order to get the true housing price, we then took the average of the states’ rows again, and averaged those means.

In [5]:
def bootstrap_house_price_mean(data, N):
    
    array = np.array(np.zeros(N))
    
    for i in range(N):
        ## Sample on our sample with replacement
        data = data.iloc[np.random.choice(len(data), len(data)),:]
    
        ## Isolate our numer data and gather a mean of the selling prices 
        numeric_data = data.loc[:,"2010-01":"2018-12"]
        bootstrap_rowmeans = np.mean(numeric_data, axis = 1)
    
        ## Append Mean to array
        array[i] = np.mean(bootstrap_rowmeans)
    
    ## return array
    return array
In [6]:
##Washington Comparison
WA_mean = math.ceil(np.mean(bootstrap_house_price_mean(WA, 1000)))
Estimated_WA_Mean = math.ceil(np.mean(np.mean(WA.loc[:,"2010-01":"2018-12"], axis = 1)))
print("The true mean for Washington is $%s" %Estimated_WA_Mean, ", while the estimated mean is $%s" %WA_mean,"given a boostrap sample of size:",len(WA),".")
The true mean for Washington is $278152 , while the estimated mean is $320342 given a boostrap sample of size: 265 .
In [7]:
#California Comparison
CA_mean = math.ceil(np.mean(bootstrap_house_price_mean(CA, 1000)))
Estimated_CA_Mean = math.ceil(np.mean(np.mean(CA.loc[:,"2010-01":"2018-12"], axis = 1)))
print("The true mean for California is $%s" %Estimated_CA_Mean, ", while the estimated mean is $%s" %CA_mean,"given a boostrap sample of size:",len(CA),".")
The true mean for California is $485392 , while the estimated mean is $208457 given a boostrap sample of size: 716 .
In [8]:
#Oregon Comparison
OR_mean = math.ceil(np.mean(bootstrap_house_price_mean(OR, 1000)))
Estimated_OR_Mean = math.ceil(np.mean(np.mean(OR.loc[:,"2010-01":"2018-12"], axis = 1)))
print("The true mean for Oregon is $%s" %Estimated_OR_Mean, ", while the estimated mean is $%s" %OR_mean,"given a boostrap sample of size:",len(OR),".")
The true mean for Oregon is $229240 , while the estimated mean is $254731 given a boostrap sample of size: 179 .
In [9]:
CA_crash = CA.loc[:,"2006-06":"2012-12"]

When we compare these results to the true housing prices, we see that we have under estimated for all the states. We could justify this under-estimation due to the fact that there can be outliers in housing prices data. For example, our California estimate was way off, and this could be because California is 1) huge in comparison to Oregon and Washington, and 2) has varied housing markets, meaning the cost to buy a house in San Francisco or Los Angeles would be greater than say buying a house in Fresno or Bakersfield. While our estimated average housing price for California was far off, our estimated average housing price for Oregon was fairly good with our estimate being 331,773' and the true value being 229,240. We see that our estimate for Washington was the best, with it being fairly close to the true price, with us off a little less than 13,000.

Lastly, we took the liberty of doing bootstrap sampling on our pacific data (which contains all three states together) so as to see what our estimated average house price of the pacific coast is and compare it to the real average price. As we see our estimate is quite off, much like our results for California. We note that the sample size for both California and for our Pacific data is much larger than our sample size for either Oregon or Washington and feel this could be a contributing factor. We also realize that since our Pacific data incorporates all the housing prices for all three states, we will be affected by the extreme outlier values that we suspect impacted our California results.

In [10]:
#Pacific Camparison
df_pacific_mean = math.ceil(np.mean(bootstrap_house_price_mean(df_pacific, 1000)))
Estimated_pacific_Mean = math.ceil(np.mean(np.mean(df_pacific.loc[:,"2010-01":"2018-12"], axis = 1)))
print("The true mean for United States Pacific Coast is $%s" %df_pacific_mean, ", while the estimated mean is $%s" %Estimated_pacific_Mean,"given a boostrap sample of size:",len(df_pacific),".")
The true mean for United States Pacific Coast is $427906 , while the estimated mean is $398521 given a boostrap sample of size: 1160 .

Also, when we compare the true pacific average housing price with our calculated estimates we achieved through bootstrapping, we see that our California estimate is the closest, with Washington second best and Oregon the farthest from the true Pacific average price with a difference of over $250,000. Again, these results makes sense due to the fact that California contains the largest sample size and therefore contributes more of the data used in the pacific estimate, and while Washington has the lowest sample size, we’d expect it to be farther off than the pacific average.

County Sold Analysis

In [11]:
!pip install geopandas
!pip install pyshp
!pip install shapely
!pip install plotly-geo

##Must "conda install plotly" on terminal before proceeding 
Requirement already satisfied: geopandas in /opt/conda/lib/python3.7/site-packages (0.6.2)
Requirement already satisfied: fiona in /opt/conda/lib/python3.7/site-packages (from geopandas) (1.8.13)
Requirement already satisfied: pandas>=0.23.0 in /opt/conda/lib/python3.7/site-packages (from geopandas) (0.25.1)
Requirement already satisfied: pyproj in /opt/conda/lib/python3.7/site-packages (from geopandas) (2.4.2.post1)
Requirement already satisfied: shapely in /opt/conda/lib/python3.7/site-packages (from geopandas) (1.6.4.post2)
Requirement already satisfied: click-plugins>=1.0 in /opt/conda/lib/python3.7/site-packages (from fiona->geopandas) (1.1.1)
Requirement already satisfied: attrs>=17 in /opt/conda/lib/python3.7/site-packages (from fiona->geopandas) (19.1.0)
Requirement already satisfied: six>=1.7 in /opt/conda/lib/python3.7/site-packages (from fiona->geopandas) (1.12.0)
Requirement already satisfied: munch in /opt/conda/lib/python3.7/site-packages (from fiona->geopandas) (2.5.0)
Requirement already satisfied: click<8,>=4.0 in /opt/conda/lib/python3.7/site-packages (from fiona->geopandas) (7.0)
Requirement already satisfied: cligj>=0.5 in /opt/conda/lib/python3.7/site-packages (from fiona->geopandas) (0.5.0)
Requirement already satisfied: python-dateutil>=2.6.1 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->geopandas) (2.8.0)
Requirement already satisfied: numpy>=1.13.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->geopandas) (1.17.2)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.23.0->geopandas) (2019.2)
Requirement already satisfied: pyshp in /opt/conda/lib/python3.7/site-packages (1.2.12)
Requirement already satisfied: shapely in /opt/conda/lib/python3.7/site-packages (1.6.4.post2)
Requirement already satisfied: plotly-geo in /opt/conda/lib/python3.7/site-packages (1.0.0)

Now that we know what the true average housing price is for each state, we want to see how the housing prices compare by county within each state. Our results above showed us that with the use of bootstrapping we were able to predict the average housing cost for Washington really well, and California less accurately. Since we already mentioned size may have contributed to this discrepancy, we would like to build on this idea and think about how the counties within each state contribute to the results we got. We believe that counties with higher population density size may also be counties with higher housing costs. While in the bootstrap part we averaged based off of zipcodes we understand that when we look at the county average housing price, we are still taking the average of the zipcodes that reside within the county. Since Washington was our best estimate, we could hypothesis that the average housing cost by counties within Washington are not so different and the population density size between counties are also fairly consistent. In contrast, we’d expect California to have a very diverse range of population density sizes when it comes to its counties, and in effect expect those with higher populations to also have an increase in housing cost, due to the fact that supply may be limited in these areas but the demand will be high and increase the prices.

In [12]:
df_sample = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/minoritymajority.csv')

df_sample_ca = df_sample[df_sample['STNAME'] == 'California']
df_sample_ca = df_sample_ca.reset_index()
CA = CA.reset_index()

df_sample_or = df_sample[df_sample['STNAME'] == 'Oregon']
df_sample_or = df_sample_or.reset_index()
OR = OR.reset_index()

df_sample_wa = df_sample[df_sample['STNAME'] == 'Washington']
df_sample_wa = df_sample_wa.reset_index()
WA = WA.reset_index()
In [13]:
#state_name is of type string
def state_setup(state_df, plot_state_df):
    
    #Set up average values
    state_df["average"] = round(np.mean(state_df.iloc[:,6:289], axis = 1),2)
    plot_state_df["Average_County_Sale_Price"] = np.zeros((len(plot_state_df)))
    
    #for loop for mapping dataframes
    row_num = 0
    for i in range(len(state_df)):
        for j in range(len(plot_state_df)):
            if(state_df.iloc[i,5] == plot_state_df.iloc[j,3]):
                row_num = j
        plot_state_df.iloc[row_num,-1] = round(state_df.iloc[i,-1])
    return plot_state_df
In [14]:
plot_CA = state_setup(CA, df_sample_ca)
plot_OR = state_setup(OR, df_sample_or)
plot_WA = state_setup(WA, df_sample_wa)

Density Population

In order to compare the Average House Price in Counties we decided to formulate the population density using our census data from 'df_sample' in order to see if there's any difference between the housing market and the area of each county. Below is a function that creates the density population for all three states

In [15]:
def add_density(state_data):
    total = 0
    ##Extract the total population
    for i in range(state_data.shape[0]):
        total += state_data.iloc[i,4]
        
    ##Create new column   
    state_data["Density"] = np.zeros(state_data.shape[0])
    
    ##Change 0 to correct population density
    for i in range(state_data.shape[0]):
        state_data.iloc[i,-1] = state_data.iloc[i,4]/total
        
    ##Resize density to plot {multiply by 1000}
    state_data["Density"] = state_data["Density"]*1000
In [16]:
add_density(plot_CA)
add_density(plot_OR)
add_density(plot_WA)

Washington Analysis / House Price and Density Population

In [17]:
import plotly.figure_factory as ff
import pandas as pd
#Oregon Block 
values = plot_OR["Average_County_Sale_Price"].tolist()
fips = plot_OR['FIPS'].tolist()

endpts = list(np.mgrid[min(values):max(values):9j])
colorscale =  ["#030512","#1d1d3b","#323268","#3d4b94","#3e6ab0",
              "#4989bc","#60a7c7","#85c5d3","#b7e0e4","#eafcfd"]
fig = ff.create_choropleth(
    fips=fips, values=values, scope=['Oregon'], show_state_data=True,
    colorscale=colorscale, binning_endpoints=endpts, round_legend_values=True,
    plot_bgcolor='rgb(229,229,229)',
    paper_bgcolor='rgb(229,229,229)',
    legend_title='Average House Price from 2010-2018',
    county_outline={'color': 'rgb(255,255,255)', 'width': 0.5},
    exponent_format=False,
)
fig.layout.template = None
fig.show()
In [18]:
values = plot_OR["Density"].tolist()
fips = plot_OR['FIPS'].tolist()

endpts = list(np.mgrid[min(values):max(values):9j])
colorscale = ["#030512","#1d1d3b","#323268","#3d4b94","#3e6ab0",
              "#4989bc","#60a7c7","#85c5d3","#b7e0e4","#eafcfd"]
fig = ff.create_choropleth(
    fips=fips, values=values, scope=['Oregon'], show_state_data=True,
    colorscale=colorscale, binning_endpoints=endpts, round_legend_values=True,
    plot_bgcolor='rgb(229,229,229)',
    paper_bgcolor='rgb(229,229,229)',
    legend_title='Population by County in Density rescaled to times 1000',
    county_outline={'color': 'rgb(255,255,255)', 'width': 0.5},
    exponent_format=False,
)
fig.layout.template = None
fig.show()

By using the average housing price and obtaining state census data we are able to visualize the varying housing costs cross the counties for each state. We believe the best place to buy a house that would be a highly populated county using our density, but also is also at or below that state average housing price. For Oregon we see that Multnomah, Clackamas and Washington Counties have higher population area but their housing prices are around 202,475- 303,712 dollars, obviously hitting our standard. The fact that these counties are highly populated makes sense because we know that Portland is within the Multnomah area and that is a big city within the state.

California Analysis / House Price and Density Population

In [19]:
#California Block 
values = plot_CA["Average_County_Sale_Price"].tolist()
fips = plot_CA['FIPS'].tolist()

endpts = list(np.mgrid[min(values):max(values):9j])
colorscale =  ["#030512","#1d1d3b","#323268","#3d4b94","#3e6ab0",
              "#4989bc","#60a7c7","#85c5d3","#b7e0e4","#eafcfd"]
fig = ff.create_choropleth(
    fips=fips, values=values, scope=['California'], show_state_data=True,
    colorscale=colorscale, binning_endpoints=endpts, round_legend_values=True,
    plot_bgcolor='rgb(229,229,229)',
    paper_bgcolor='rgb(229,229,229)',
    legend_title='Average House Price from 2010-2018',
    county_outline={'color': 'rgb(255,255,255)', 'width': 0.5},
    exponent_format=False,
)
fig.layout.template = None
fig.show()
In [20]:
values = plot_CA["Density"].tolist()
fips = plot_CA['FIPS'].tolist()

endpts = list(np.mgrid[min(values):max(values):9j])
colorscale = ["#030512","#1d1d3b","#323268","#3d4b94","#3e6ab0",
              "#4989bc","#60a7c7","#85c5d3","#b7e0e4","#eafcfd"]
fig = ff.create_choropleth(
    fips=fips, values=values, scope=['California'], show_state_data=True,
    colorscale=colorscale, binning_endpoints=endpts, round_legend_values=True,
    plot_bgcolor='rgb(229,229,229)',
    paper_bgcolor='rgb(229,229,229)',
    legend_title='Population by County in Density rescaled to times 1000',
    county_outline={'color': 'rgb(255,255,255)', 'width': 0.5},
    exponent_format=False,
)
fig.layout.template = None
fig.show()

When we check for California, we see that the most densely populated counties are in Southern California, more specifically they are Los Angeles, San Bernardino, Riverside, Orange and San Diego. Despite these highly populated areas the most expensive place to live was in Marin County within the Bay Area, this could possibly be explained due to the proximity to San Francisco so people that live in Marin may be able to commute to SF for work or pleasure but are outside the hectic city life. When considering a highly populated area but costs at or below California Housing average, Los Angeles county would be the best choice, and we see the Bay Area is quite expensive for the minimal population size.

Washington Analysis / House Price and Density Population

In [21]:
#Washington Block 
values = plot_WA["Average_County_Sale_Price"].tolist()
fips = plot_WA['FIPS'].tolist()

endpts = list(np.mgrid[min(values):max(values):9j])
colorscale = ["#030512","#1d1d3b","#323268","#3d4b94","#3e6ab0",
              "#4989bc","#60a7c7","#85c5d3","#b7e0e4","#eafcfd"]
fig = ff.create_choropleth(
    fips=fips, values=values, scope=['Washington'], show_state_data=True,
    colorscale=colorscale, binning_endpoints=endpts, round_legend_values=True,
    plot_bgcolor='rgb(229,229,229)',
    paper_bgcolor='rgb(229,229,229)',
    legend_title='Average House Price from 2010-2018',
    county_outline={'color': 'rgb(255,255,255)', 'width': 0.5},
    exponent_format=False,
)
fig.layout.template = None
fig.show()
In [22]:
values = plot_WA["Density"].tolist()
fips = plot_WA['FIPS'].tolist()

endpts = list(np.mgrid[min(values):max(values):9j])
colorscale = ["#030512","#1d1d3b","#323268","#3d4b94","#3e6ab0",
              "#4989bc","#60a7c7","#85c5d3","#b7e0e4","#eafcfd"]
fig = ff.create_choropleth(
    fips=fips, values=values, scope=['Washington'], show_state_data=True,
    colorscale=colorscale, binning_endpoints=endpts, round_legend_values=True,
    plot_bgcolor='rgb(229,229,229)',
    paper_bgcolor='rgb(229,229,229)',
    legend_title='Population by County in Density rescaled to times 1000',
    county_outline={'color': 'rgb(255,255,255)', 'width': 0.5},
    exponent_format=False,
)
fig.layout.template = None
fig.show()

Lastly, comparing Washington’s population by county density to its average housing rate we see that the most populated counties are King, and then Pierce and Snohomish and we see that the most expensive housing prices are within King County as well. Again, this makes sense due to the fact that Seattle is a bustling major city so we would expect the housing there to be more expensive. As for the counties that we think would be best to buy a house in, we can conclude Snohomish followed by the Pierce county. We see that Snohomish county is highly populated, and housing price range from the 154,903 – 309,807 dollar category.

As we expected, the most highly populated counties within the states also had the highest housing rates (with exception to California), but we were still able to find highly populated counties that were within the ball park of the state average.

Market House SEASONAL price

So far, we have determined the average housing price for California, Washington and Oregon as well as visualize differences in housing prices and population density with respect to counties. We have come to grasp a clearer idea of where to look for an affordable house but we are curious as to when would be a good time to buy a house? Basing our sense of seasons off of the solstices and the equinoxes we define Winter to be from December 21st – March 20th, Spring from March 21st – June 20th, Summer to June 21st – September 20th, and lastly Autumn from September 21st – December 20th. Now after clearly defining what months constitute for which season, we hypothesis that Autumn or Winter would be the best time to buy a house primarily because we think December would be the best month. Noting that it starts getting colder and rainier in all states during this time we would expect the number of competitive home buyers to drop therefore dropping the price of the houses to compensate for the lower demand. We would think Summer could be the worst time to buy because the number of people attending open houses may increase due to the better whether therefore leading to an increase in the demand of the houses we are most interested in. By doing a box and whiskers plot as well as a violin plot for each state we are able to clearly see which season has the lowest to highest prices, as well as the span of the frequency of number of house for varying prices offered within each season based off of the tally marks within our violin plots.

Seasonal Months of House Market

In [23]:
def season_split(state, season):
    ##state: type dataframe, of the state you want to split {CA,OR,WA}
    ##season: string of season {"winter","spring", "summer", "fall"}
    state_data = state.loc[:,"2010-01":"2018-12"]
    ##Intialize array
    indicies = np.array(np.zeros(27))
    
    ##Grab correct season
    if (season == "winter"):
        start = 0
    if (season == "spring"):
        start = 3
    if (season == "summer"):
        start = 6
    if (season == "fall"):
        start = 9
        
        
    i = start
    location = 0
    j = 0
    ##for loop for column numbers
    while (i < state_data.shape[1]):
        while (j < 3):
            indicies[location] = i+j
            location += 1
            j += 1
        j = 0
        i += 12
    ## make array for titles
    state_title = state.iloc[:,[2,3,5]]
    
    ##make array for data only
    state_data = state_data.iloc[:,indicies]
    ##return correct array
    return pd.concat([state_title,state_data], axis = 1)

Average House Market in Months from 2010-2019

The code below is the average house market in months from 2010-2019

In [24]:
years_09_18 = CA.iloc[:,160:280]
#month has to be initialized as an empty month 
month = pd.DataFrame()
def get_month_data(month,years,name,month_index): 
    month = pd.DataFrame()
    month_name = str(name)
    month_means = pd.DataFrame()
    for i in range(12):
        month = pd.concat([month,years.iloc[:,(i*12+(month_index-1)):(i*12+month_index)]],axis = 1)
    month_means =pd.DataFrame({month_name: np.mean(month)})
    month_means.set_index([pd.Series([2009,2010,2011,2012,2013,2014,2015,2016,2018,2019])],inplace=True)
    return month_means
In [25]:
january = get_month_data(month,pacific_part3,"January",1)
february = get_month_data(month,pacific_part3,"February",2)
march  = get_month_data(month,pacific_part3,"March",3)
april  = get_month_data(month,pacific_part3,"April",4)
may = get_month_data(month,pacific_part3,"May",5)
june = get_month_data(month,pacific_part3,"June",6)
july = get_month_data(month,pacific_part3,"July",7)
august = get_month_data(month,pacific_part3,"August",8)
september = get_month_data(month,pacific_part3,"September",9)
october = get_month_data(month,pacific_part3,"October",10)
november = get_month_data(month,pacific_part3,"November",11)
december = get_month_data(month,pacific_part3,"December",12)

California Market Price Average by Season

A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.

In [26]:
import seaborn as sns
%matplotlib inline

CA_wint = season_split(CA, "winter").mean().values
CA_spring = season_split(CA, "spring").mean().values
CA_summer = season_split(CA, "summer").mean().values
CA_fall = season_split(CA, "fall").mean().values

Seasons_CA = {"winter":CA_wint,
          "Spring":CA_spring,
          "Summer":CA_summer,
          "Fall":CA_fall}

plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
test_CA = pd.DataFrame((Seasons_CA))
sns.boxplot(data=test_CA,palette='rainbow',orient='h')

plt.subplot(1,2,2)
sns.violinplot(data=test_CA,palette='rainbow')
sns.swarmplot(data=test_CA)
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ffb47708a90>

When we look at the seasonal analysis for California, we see that Winter seems to be the best time to buy a house, due to the fact the box for winter is comparatively smaller than all the other seasons. We also see that Fall is the worst time for us to buy a house with the average cost of a house in Fall being much larger than any of the other average costs for any of the other seasons. In regards to our violin plot for California, we see that Winter has a larger shape especially in the 400,000 dollar range while in Summer and Fall the shape remains fairly constant. This shows us that we have higher probability of buying a home in winter for around 400,000 dollars than any of the other seasons, and the odds of buying a home is Autumn is fairly constant despite the price.

Oregon Market Price Average by Season

In [27]:
OR_wint = season_split(OR, "winter").mean().values
OR_spring = season_split(OR, "spring").mean().values
OR_summer = season_split(OR, "summer").mean().values
OR_fall = season_split(OR, "fall").mean().values

Seasons = {"winter":OR_wint,
          "Spring":OR_spring,
          "Summer":OR_summer,
          "Fall":OR_fall}

plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
test_OR = pd.DataFrame((Seasons))
sns.boxplot(data=test_OR,palette='rainbow',orient='h')

plt.subplot(1,2,2)
sns.violinplot(data=test_OR,palette='rainbow')
sns.swarmplot(data=test_OR)
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ffb478a91d0>

We see that the results for Oregon, are fairly consistent with the results we obtained for California. We see again that Winter is a better time for use to buy an affordable house. Again, Fall is the most expensive time to buy, but we do see that Spring also has a fairly high probability of buying a home around 200,000 dollars, and as denoted by the tick marks may even have more variety of choice when it comes to choosing a house.

Washington Market Price Average by Season

In [28]:
WA_wint = season_split(WA, "winter").mean().values
WA_spring = season_split(WA, "spring").mean().values
WA_summer = season_split(WA, "summer").mean().values
WA_fall = season_split(WA, "fall").mean().values

Seasons_WA = {"winter":WA_wint,
          "Spring":WA_spring,
          "Summer":WA_summer,
          "Fall":WA_fall}

plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
test_WA = pd.DataFrame((Seasons_WA))
sns.boxplot(data=test_WA,palette='rainbow',orient='h')

plt.subplot(1,2,2)
sns.violinplot(data=test_WA,palette='rainbow')
sns.swarmplot(data=test_WA)
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ffb4e745da0>

Lastly, Washington data compliments the results the results of the last two states. We see from our violin plots that winter is the best time giving us higher probability of finding a house ranging from 250,000 to 300,000 dollars. Also, Spring does have more variety in the amount of houses offered at varying prices.

It is clear that Winter (December 21st – March 20th) is a great time to buy a house, but in the same regards Spring is favorable especially we would like to have more options to choose from in the housing market. While Spring has minimal increase in housing prices, it also appears to have more houses on the market spread throughout the varying housing prices.

Monthly Pacific Market 2010-2018

In [29]:
pacific_month_avg = pd.concat([january,february,march,april,may,june,july,august,september,october,november,december],axis = 1)
pacific_month_avg.index.name='year'
pacific_month_avg
Out[29]:
January February March April May June July August September October November December
year
2009 362437.699115 358860.973451 354770.265487 350365.398230 345654.336283 341364.513274 338536.548673 336905.752212 335949.469027 335512.300885 335509.469027 335601.681416
2010 335531.150442 335959.203540 335365.575221 333894.867257 332803.982301 331881.858407 329462.610229 327605.643739 325542.239859 323244.797178 320901.058201 318295.414462
2011 315815.418502 313050.660793 310848.281938 308888.017621 307120.000000 305683.964758 304164.379947 302962.972735 301742.568162 300831.750220 300213.456464 299649.956025
2012 299246.262093 299358.047493 300239.929639 301931.750220 304172.559367 306411.257696 308675.021988 311097.537379 313850.747581 316790.237467 319940.017590 323240.017590
2013 326534.265734 330319.842657 335232.342657 340827.097902 346263.199301 351490.734266 356024.432810 359865.619546 363050.698080 365870.418848 368350.959860 370830.890052
2014 373084.708949 375605.386620 377947.523892 380200.000000 382276.542137 384356.038228 386057.192374 388264.904679 390533.535529 392999.826690 395543.240901 398207.452340
2015 402779.220779 406084.242424 409625.281385 413257.402597 416757.316017 420014.199134 423139.273356 426441.262976 429485.553633 431924.134948 434263.667820 436895.934256
2016 438787.844828 441453.017241 444517.931034 447442.844828 449856.637931 452263.965517 454976.896552 457643.017241 460545.344828 463813.534483 467320.344828 470907.241379
2018 474274.310345 477565.948276 481118.189655 484598.793103 487605.258621 490381.034483 493219.482759 496088.189655 499222.155172 502701.120690 506237.068966 509376.896552
2019 513571.724138 518893.534483 522957.241379 525436.206897 528109.396552 530753.448276 532637.672414 534348.189655 536593.879310 538950.172414 540477.758621 541395.344828

Monthly BoxPlot on Pacific

With these results in mind, we wanted to do a comparative of the months, since we saw roughly the same results for all three states, we did this analysis on the Pacific Coast dataset. We see that May, April, and March appear to be the best months to buy with May having the lowest average cost of all the months. This further solidifies our conclusion that Spring would be overall the best season to buy a house.

In [30]:
plt.figure(figsize=(15,10))
sns.boxplot(data=pacific_month_avg,palette='rainbow',orient='h')
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ffb4e676278>

United States housing bubble

Despite figuring out the most cost effective places to buy a house within California, Washington, and Oregon as well as the best season, we are still apprehensive in regards to if buying right now or in the next few years would be a sound investment. We remember the housing bubble that occurred from early 2006 to 2012. In early 2006, the housing prices peaked and then resulted in a decline until it hit it’s lowest in 2012, and even before that in 2008 had the largest price drop in history which resulted in affecting the 2007-2009 recession (Housing Bubble). We want to see if looking at the average housing price from 2000 -2019 can give us some indication of whether or not we are in a peak or decline in the housing market currently, so as to have a better understanding of if now or the immediate future is the best time to buy.

In [31]:
import seaborn as sns
%matplotlib inline

plt.figure(figsize=(250,50), dpi=80, facecolor='w',edgecolor='k')
mean_CA = np.mean(CA.iloc[:,56:290])
year_month=(np.mean(CA.iloc[:,56:290]).index).tolist()
y_pos = np.arange(len(year_month))
plt.plot(year_month, mean_CA,color='mediumvioletred',linewidth=10)
plt.title("average home prices (2000-2019)",fontsize=12)
plt.xticks(y_pos, year_month,fontsize=30, rotation=90)
#plt.xlabel("date")
plt.ylabel("average house price")
plt.show()

From our Zillow data, we see that there was an increase from 2000 to early 2006, and we can also see the sharp decline in prices from 2006-2012. After being able to identify the housing bubble incident, we can see that the housing market has been steadily climbing higher and higher and peak has even surpassed the peak we witnessed in 2006. We also see at the tail a faint decrease, leading us to believe that the housing market is or will in the near future experience a fall again. We understand that there are a multitude of factors that go into creating an impact on the housing market so it would be difficult to try to forecast when and for how long we may experience this drop, therefore we leave it say that we believe the market may continue to decrease, and when more data becomes available to us we may be able to create a more definitive answer.

Conclusion

When analyzing the house market it was challenging to filter and find data since most websites only release a limited amount. Our Zillow data had county means by time series which was a challenge to get any information for that. But in conclusion we were able to get the population density from census and approximate some of our limited data. In addition, When it comes to buying a home, it is clear that a lot of factors go into the decision making process, we decided to focus primarily on cost, more specific how the cost for buying a house varies not only between three different states but within each state, as well as when is the best time to buy. In doing so we have gained a better understanding of the housing market and more insight on the cost dynamic but we have also disregarded some other key aspects that go into the decision process when buying a house. We have performed these test with the expectation that we already knew we wanted to buy a home with 3-bedrooms, so those more direct factors are not what we are referring too, rather we understand that the geography of where we choose to buy often impacts whether we buy and sometimes even the price of the house. In the future, we may look to pin point which houses are within a certain mile radius of a specific branch or hospital, or maybe which houses are around a downtown area that has a fun/lively night life, or even create a colormap that plots crime rate in the last 5 years so as to be sure we choose a safe neighborhood.

REFERENCE

[SOURCES]

“Housing Data.” Zillow Research, https://www.zillow.com/research/data/.

“United States Housing Bubble.” Investopedia, investopedia, 19 Nov. 2019, https://www.investopedia.com/terms/h/housing_bubble.asp.

“USA County Choropleth Maps.” USA County Choropleth Maps | Python | Plotly, https://plot.ly/python/county-choropleth/.